A PROXIMAL METHOD FOR COMPOSITE MINIMIZATION 
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Abstract. We consider minimization of functions that are compositions of prox-regular functions 
with smooth vector functions. A wide variety of important optimization problems can be formulated 
in this way. We describe a subproblem constructed from a linearized approximation to the objective 
and a regularization term, investigating the properties of local solutions of this subproblem and 
showing that they eventually identify a manifold containing the solution of the original problem. We 
propose an algorithmic framework based on this subproblem and prove a global convergence result. 
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1. Problem Statement. We consider minimization problems of the form 

min h(^c{x)) , (1-1) 

where the inner function c : R" IR™ is smooth. On the other hand, the outer 
function h : R™ — > [—00, +00] may be nonsmooth, but is usually convex, and in 
some way structured: it is often even polyhedral. Assuming that h is sufficiently 
well-structured to allow us to solve, relatively easily, subproblems of the form 

min/»(<i>(d)) + |Mp, (1.2) 

for affine maps $ and scalars /i > (where | • | denotes the Euclidean norm through- 
out the paper), we design and analyze a "proximal" method for the problem (|l.ip . 
More precisely, we consider an algorithmic framework in which a proximal linearized 
subproblem of the form (|1.2p is solved at each iteration to define a first approximation 
to a step. If the function h is sufficiently well-structured — an assumption we make 
concrete using "partial smoothness," a generalization of the idea of an active set in 
nonlinear programming — we may then be able to enhance the step, possibly with the 
use of higher-order derivative information. 

Many important problems in the form (jl.ip involve finite convex functions h. 
Our development explores, nonetheless, to what extent the underlying theory for the 
proposed algorithm extends to more general functions. Specifically, we broaden the 
class of allowable functions h in two directions: 

• h may be extended- valued, allowing constraints that must be enforced; 

• we weaken the requirement of convexity to "prox-regularity" . 

This broader framework involves extra technical overhead, but we point out through- 
out how the development simplifies in the case of continuous convex h, and in partic- 
ular polyhedral h. 

Let us fix some notation. We consider a local solution (or, more generally, critical 
point) X for the problem (jl.ip . and let c := c{x). (Our assumption that the function c 
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is everywhere defined is primarily for notational simplicity: restricting our analysis to 
a neighborhood of x is straightforward.) The criticality condition is G d{ho c){x), 
where d denotes the subdifferential. As we discuss below, under reasonable conditions, 
a chain rule then implies existence of a vector v such that 

V € dh{c) n Null(Vc(x)*), (1.3) 

where Vc(i) : R" R™ is the derivative of c at a; and * denotes the adjoint map. 
In typical examples, we can interpret the vector(s) v as Lagrange multipliers, as we 
discuss below. 

We prove results of three types: 

1. When the current point x is near the critical point x, the proximal linearized 
subproblem (|1.2p has a local solution d of size 0(|a; — a;|). By projecting the 
point x + d onto the inverse image under the map c of the domain of the 
function h, we can obtain a step that reduces the objective p.ip . 

2. Under reasonable conditions, when x is close to x, if h is "partly smooth" at 
c relative to a certain manifold A4 (a generalization of the surface defined by 
the active constraints in classical nonlinear programming) , then the algorithm 
"identifies" M: The solution d of the subproblem (|1.2p has ^{d) e M. 

3. A global convergence result for an algorithm based on (|1.2p . 

1.1. Definitions. We begin with some important definitions. We write R for 
the extended reals [— oo,+cx)], and consider a function h : R™ R. The notion 
of the subdifferential of ft. at a point c £ R™, denoted dh(c), provides a powerful 
unification of the classical gradient of a smooth function, and the subdifferential from 
convex analysis. It is a set of generalized gradient vectors, coinciding exactly with the 
classical convex subdifferential [33j when h is lower semicontinuous and convex, and 
equalling {V/i(c)} when h is around c. For the formal definition, and others from 
variational analysis, the texts [34] and [27] are good sources. 

An elegant framework for unifying smooth and convex analysis is furnished by 
the notion of "prox- regularity" [2^ . Geometrically, the idea is rather natural: a set in 
S C R™ is prox-regular aX a point s G 5" if every point near s has a unique nearest point 
in S (using the Euclidean distance). In particular, closed convex sets are prox-regular 
at every point. A finite collection of equality and inequality constraints defines a 
set that is prox-regular at any point where the gradients of the active constraints are 
linearly independent. 

A function h : R™ ^ R is prox-regular at a point c if h{c) is finite and the epigraph 

epift := {(c,r) e R"' x R : r > h{c)) 

is prox-regular at the point (c, h{c)). In particular, both convex and functions are 
prox-regular wherever they are defined. 

A general class of prox-regular functions common in engineering applications are 
"lower C^" (see Rockafellar and Wets [34]). A function h : R™ ^ R is lower around 
a point c S R"* if h has the local representation 

h{c) — max/(c, t) for c G R™ near c, 

for some function / : R™ x T ^ R, where the space T is compact and the quantities 
/(c, i), \7cf{c, t), and V^^/(c, t) all depend continuously on (c, t). A simple equivalent 
property, useful in theory though harder to check in practice, is that h has the form 
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5 — k| • p around the point c for some continuous convex function g and some constant 

K. 

The original definition of prox-regularity given by Pohquin and Rockafellar |29j 
involved the subdifferential, as follows. For the equivalence with the geometric defi- 
nition above, see Poliquin, Rockafellar, and Thibault [30] , 

Definition 1.1. A function h : R"^ ^ R is prox-regular at a point c e R™ for a 
subgradient v G dh{c) if h is finite at c, locally lower semicontinuous around c, and 
there exists p > such that 

h{c')>h{c) + {v,c' -c)-^\c' -c\^ 

whenever points c,c' £ R™ are near c with the value h{c) near the value h{c) and for 
every subgradient v g dh[c) near v. Further, h is prox-regular at c if it is prox-regular 
at c for every v € dh{c). 

Note in particular that if h is prox-regular at c, we have that, for every v € dh(c), 
there exists p > such that 

h{c')>h{c) + {v,c'-c)-^\c'-c\^ (1.4) 

whenever c' is near c. (Set c = c in the definition above.) 

A weaker property than the prox-regularity of a function h is "subdifferential 
regularity," a concept easiest to define in the case in which h is Lipschitz. In this 
case, h is almost everywhere differentiable: it is subdifjerentially regular at a point 
c £ R™ if its classical directional derivative for every direction d G R™ equals 

lim sup (Vft.(c), d), 

c — >c 

where the lim sup is taken over points c where h is differentiable. Clearly, func- 
tions have this property; continuous convex functions also have it. For nonlipschitz 
functions the notion is less immediate to define (see Rockafellar and Wets [31]), but 
it holds for lower semicontinuous, convex functions (see |341 Example 7.27]) and more 
generally for prox-regular functions. 

We next turn to the idea of "partial smoothness" introduced by Lewis [15], a 
variational-analytic formalization of the notion of the active set in classical nonlinear 
programming. The notion we describe here is, more precisely, "C^-partial smooth- 
ness": see Hare and Lewis [ini Definition 2.3]. In the definition below, a set A4 C R™ 
is a manifold about a point c € if it can be described locally by a collection of 
smooth equations with linearly independent gradients: more precisely, there exists a 
map F : R™ R'^ that is around c with VF(c) surjective and such that points 
c G R™ near c lie in M if and only if F{c) = 0. The classical normal space to Ai at 
c, denoted Nm{c) is then just the range of VF(c)*. 

Definition 1.2. A function h : R™ R is partly smooth at a point c g R™ 
relative to a set Ai C R™ containing c if M. is a manifold about c and the following 
properties hold: 

(i) (Smoothness) The restricted function h\M is near c; 

(ii) (Regularity) h is subdifjerentially regular at all points c £ Ai near c, with 
dh{c) ^ 0; 

(iii) (Sharpness) The affine span of dh{c) is a translate of Nm{c); 

(iv) (Sub-continuity) The set-valued mapping dh : M. R™ is continuous at c. 
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We refer to M. as the active manifold. 

A set S C R™ is partly smooth at a point c G S relative to a manifold M if its 
indicator function, 



Ssic) = 



{c€S) 
+00 ic<^S), 



is partly smooth at c relative to A4 . Again we refer to A4 as the active manifold. 

We denote by Ps{v) the usual Euclidean projection of a vector v G R™ onto a 
closed set S C R™. The distance between x and the set S is 

dist (x, S) — inf jx — yj. 

y£S 

We use B^ix) to denote the closed Euclidean ball of radius e around a point x. 

2. Examples. The framework (|l.ip admits a wide variety of interesting prob- 
lems, as we show in this section. 

2.1. Approximation Problems. 

Example 2.1 (least squares, li, and Huber approximation). The formulation 
encompasses both the usual (nonlinear) least squares problem if we define h{-) = 
I • and the £1 approximation problem if we define h{-) = | • |i, the £i-norm. Another 
popular robust loss function is the Huber function defined by h{c) = J2i=i 4'{ci), where 



^cf (|c.|<r) 

Tc,-\T^ {\c,\>T). 



Example 2.2 (sum of Euclidean norms). Given a collection of smooth vector 
functions gt : ^ R™' , for i = 1, 2, . . . , t, consider the problem 



mm ^\g,{x)\. 



We can place such problems in the form lil.l]) by defining the smooth vector function 
c : R" R"'^ X R™^ X • • • X R™' by c — (gi,g2, ■ ■ ■ ,gt), and the nonsmooth function 
h : R'"^ X • • • X R'"' ^Rby 

t 

^(51,52, ...,gt) = ^ \gt\- 
1=1 



2.2. Problems from Nonlinear Programming. Next, we consider examples 
motivated by penalty functions for nonlinear programming. 

Example 2.3 {ii penalty function). Consider the following nonlinear program: 

min f{x) (2.1) 
subject to gi{x) =0 (1 < « < j), 

9^{X) < {j<l< fc), 

X e X, 
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where the polyhedron X C R" describes constraints on the variable x that are easy to 
handle directly. The £i penalty function formulation is 

j k 

min f{x) + v^\g,{x)\+v ^ max {O, g,{x)) , (2.2) 

where v > Q is a scalar parameter. We can express this problem in the form by 
defining the smooth vector function 

c{x) = [fix) , , ^) e R x R'^ x R" 

and the extended polyhedral convex function h:RxR''xR"'^ R by 



3 k 
+00 {x ^ X). 



A generalization ofExamplc l2.3l in which his & finite polyhedral function was the 
focus of much research in the 1980s. We consider this case further in Section [3] and 
use it again during the paper to illustrate the theory that we develop. 

2.3. Regularized Minimization Problems. A large family of instances of 
(jl.ip arises in the area of regularized minimization, where the minimization problem 
has the following general form: 

min /(a;) + T|a;|, (2.3) 

X 

where / : R" ^ R is a smooth objective, while is a continuous, nonnegative, 
usually nonsmooth function, and r is a nonnegative regularization parameter. Such 
formulations arise when we seek an approximate minimizer of / that is "simple" in 
some sense; the purpose of the second term is to promote this simplicity property. 
Larger values of r tend to produce solutions x that are simpler, but less accurate as 
minimizers of /. The problem (|2.3p can be put into the framework (jl.ip by defining 

eR"+\ h{f,x)^f + T\xU. (2.4) 

We list now some interesting cases of (|2.3p . 

Example 2.4 (^i-regularized minimization). The choice |-|* = |-|i m Ii2.3\) tends 
to produce solutions x that are sparse, in the sense of having relatively few nonzero 
components. Larger values of t tend to produce sparser solutions. Compressed sensing 
is a particular area of current interest, in which the objective f is typically a least- 
squares function f{x) = (1/2) | Ax — fep; see fd^ for a recent survey. Regularized 
least-squares problems (or equivalent constrained- optimization formulations) are also 
encountered in statistics; see for example the LASSO 1381 and LARS I12\j procedures, 
and basis pursuit J7^. 

A related application is regularized logistic regression, where again | • |* = | • |i, but 
f is (the negative of) an a posteriori log likelihood function. In the setup of |c^6T /, x 
contains the coefficients of a basis expansion of a log- odds ratio function, where each 
basis function is a function of the feature vector. The objective f is the ( negative ) log 



c{x) 



fix) 

X 
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likelihood function obtained by matching this data to a set of binary labels. In this 
case, f is convex but highly nonlinear. The regularization term causes the solution to 
have few nonzero coefficients, so the formulation identifies the most important basis 
functions for predicting the observed labels. 

Another interesting class of regularized minimization problems arises in matrix 
completion, where we seek an m x n matrix X of smallest rank that is consistent 
with given knowledge of various linear combinations of the elements of X; see [5l I31i 
11]. Much as the £i of a vector x is used as a surrogate for cardinality of x in the 
formulations of Example 12. 4| the nuclear norm is used as a surrogate for the rank 
of X in formulations of the matrix completion problem. The nuclear norm is 
defined as the sum of singular values of X, and we have the following specialization 
of (ESI): 

min ^\A{X)~-b\^ + t\XU, (2.5) 

where A denotes a linear operator from R™^" to R'', and 6 € is the observation 
vector. Note that the nuclear norm is a continuous and convex function of X. 

2.4. Nonconvex Problems. Each of the examples above involves a convex 
outer function h. In principle, however, the techniques we develop here also apply to 
a variety of nonconvex functions. The next example includes some simple illustrations. 

Example 2.5 (problems involving quadratics). Given a general quadratic func- 
tion / : R^ — !■ R (possibly nonconvex) and a smooth function ci : R" — > R^, consider 
the problem min^, /(ci(x)). This problem trivially fits into the framework lil.l]) . and 
the function f , being , is everywhere prox-regular. The subproblems for suf- 

ficiently large values of the parameter fj,, simply amount to solving a linear system. 

More generally, given another general quadratic function g : ^ R, and another 
smooth function C2 : R" — > IR*, consider the problem 

min /(ci(a;)) subject to g(^C2{x)^ < 0. 

a;G R^^ 

We can express this problem in the form il.l]) by defining the smooth vector function 
c= (ci,C2) and defining an extended-valued nonconvex function 



h{ci,C2) 

The epigraph of h is 



/(ci) (5(C2)<0) 
+00 (.g(c2) > 0). 



{{ci,C2,t):g{c2)<0, t>f{ci)}, 

a set defined by two smooth inequality constraints: hence h is prox-regular at any point 
(ci, C2) satisfying .9(02) < and Vg(c2) 7^ 0. The resulting subproblems are all in 

the form of the standard trust-region subproblem, and hence relatively straightforward 
to solve quickly. 

As one more example, consider the case when the outer function h is defined 
as the maximum of a finite collection of quadratic functions (possibly nonconvex) : 
h(x) = max{/i(a;) : i = 1, 2, . . . , k}. We can write the subproblems in the form 

min|t ; < > /,($(d)) + d ^ R", t G R, i = 1, 2, . . . , /c}. 



where the map $ is affine. For sufficiently large values of the parameter fi, this 
quadratically- constrained convex quadratic program can in principle be solved effi- 
ciently by an interior point method. 

To conclude, we consider two more applied nonconvex examples. The first is due 
to Mangasarian [23] and is used by Jokar and Pfetsch jT8] to find sparse solutions of 
underdetermined linear equations. The formulation of [TB] can be stated in the form 
l|2.3p where the regularization function | • |* has the form 



i=l 

for some parameter a > 0. It is easy to see that this function is nonconvex but 
prox-regular, and nonsmooth only at Xi = 0. 

Zhang et al. |13j use a similar regularization function of the form (|2.3p that behaves 
like the £i norm near the origin and transitions (via a concave quadratic) to a constant 
for large loss values. Specifically, we have | • |* = ^"=1 't'ixi), where 

<Pix^) = I - 2aX\x,\ + A2)/(2(a - 1)) (A < < aX) 

[(a+l)AV2 (|xi|>aA). 

Here A > and a > 1 arc tuning parameters. 

3. The Finite Polyhedral Case. As we have remarked, a classical example of 
our model problem miua; /i(c(a;)) is the case in which the outer function h is finite and 
polyhedral: 

/i(c) = max{(;i,,c) +A} (3.1) 

for some given vectors hi G R™ and scalars Pi , where the index i runs over some finite 
set /. We use this case to illustrate much of the basic theory we develop. 

Assume the map c: R" — > R™ is around a critical point x € R" for the 
composite function hoc, and let c — c{x). Define the set of "active" indices 

/ = argmax{(ft,i, c) + (3i : i G /}. 

Then, denoting convex hulls by conv, we have 

dh(c) = conv{hi : i G /}. 

Hence the basic criticality condition (|1.3p becomes the existence of a vector A G R^ 
satisfying 



A > and ^ Ai 



" Vc(x)*/i, ■ 




■ " 


1 




1 



(3.2) 



The vector v is then ^i^j Xihi. 

Compare this with the classical nonlinear programming framework, which is 

min t 

subject to {hi,c{x)) + + t < (i & I) (3.3) 
{x, t) e R" X R. 
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At the point (x,— /i(c)), the conditions p.2p are just the standard first-order op- 
timahty conditions, with Lagrange multiphers A^. The fact that the vector v in the 
criticahty condition (|1.3p is closely identified with A via the relationship v — X^ie/ ^i^i 
motivates our terminology "multiplier vector" . 

We return to this example repeatedly in what follows. 

4. The Proximal Linearized Subproblem. In this work we consider an al- 
gorithmic framework based on solution of a proximal linearized subproblem of the 
form (|1.2p at each iteration. We focus on the case in which ^{d) is the Taylor-series 
linearization of c around the current iterate x, which yields the following subproblem: 

mm K^^id) ~ h{c{x) + Vc(x)d) + (4.1) 

where /i > is a parameter and the linear map Wc{x) : R" R™ is the derivative of 
the map c at a; (representable by the m x n Jacobian matrix). 

For simplicity, consider first a function h : R™ — )■ (— oo, +oo] that is convex and 
lower semicontinuous. Assuming that the vector c{x) + \7c{x)d lies in the domain of 
h for some step d G R", the subproblem (j4.ip involves minimizing a strictly convex 
function with nonempty compact level sets, and thus has a unique solution d — d{x). 
If we assume slightly more — that c{x) + '\7c{x)d lies in relative interior of the domain 
of h for some d (as holds obviously if h is continuous at c{x)), a standard chain rule 
from convex analysis implies that d = d{x) is the unique solution of the following 
inclusion: 

Vc(x)*w + lid = 0, for some v € dh{c{x)+Wc{x)d). (4.2) 

When h is just prox-regular rather than convex, under reasonable conditions (see 
below), the subproblem (|4.1[) still has a unique local solution close to zero, for /i 
sufficiently large, which is characterized by property (j4.2p . 

For regularized minimization problems of the form (|2.3p . the subproblem (|4.ip 
has the form 

min f{x) + (V/(x), d) + +T\x + d\,. (4.3) 

d Z 

An equivalent formulation can be obtained by shifting the objective and making the 
change of variable z :^ x -\- d: 

min — |z — ?/p -I- tIzL, where y = x \7f(x). (4-4) 

z 2 ' 

When the regularization function | • |, is separable in the components of cc, as when 
I • I* = I • |i or I • I = I • I2, this problem can be solved in 0{n) operations. (Indeed, 
this fact is key to the efficiency of methods based on these subproblems in compressed 
sensing applications.) For the case | • |* = | • |i, if we set a = r//x, the solution of (|4.4p 
is 

fo m<a) 

Zi= Ivi-a {y, > a) (4.5) 
[vi + a {y, < ~a). 

The operation specified by (|4.5p is known commonly as the "shrink operator." 
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For the matrix completion formulation (|2.5p . the formulation (|4.4p of the sub- 
problem becomes 



zeR' 



mm 



tt\Z~Y\% + r\Z\ 



* ; 



(4.6) 



where | • |_f denotes the Frobenius norm of a matrix and 

Y --A*[AiX)-b]. 



(4.7) 



It is known (see for example |^) that (12. 5p can be solved by using the singular- 
value decomposition of Y. Writing Y — U'SV'^ , where U and V are orthogonal and 
E = diag(CTi, (T2, . . . , CTj-ninj^nj „■)), wc havc Z — UYj^/^V'^ , where the diagonals of S^./^ 
are max(f7i — T//i,0) for i = 1, 2, . . . , min(m, n). In essence, we apply the shrink 
operator to the singular values of Y ^ and reconstruct Z by using the orthogonal 
matrices U and V from the decomposition of Y . 

5. Related Work. We discuss here some connections of our approach with 
existing literature. 

We begin by considering various approaches when the outer function h is finite 
and polyhedral. One closely related work is by Fletcher and Sainz de la Maza [T5j . 
who discuss an algorithm for minimization of the ^i-penalty function (j2.2p for the 
nonlinear optimization problem (j2.ip . The first step of their method at each iteration 
is to solve a linearized trust-region problem which can be expressed in our general 
notation as follows: 



where p is some trust-region radius. Note that this subproblem is closely related to 
our linearized subproblem (|4.ip when the Euclidean norm is used to define the trust 
region. However, the too norm is preferred in [13j . as the subproblem (|5.ip can then 
be expressed as a linear program. The algorithm in [13] uses the solution of (|5.ip 
to estimate the active constraint manifold, then computes a step that minimizes a 
model of the Lagrangian function for (|2.ip while fixing the identified constraints as 
equalities. A result of active constraint identification is proved ([131 Theorem 2.3]); 
this result is related to our Theorems 16. 121 and 17.51 below. 

Byrd et al. [3] describe a successive linear-quadratic programming method, based 
on [13j . which starts with solution of the linear program (jS.ip (with too trust region) 
and uses it to define an approximate Cauchy point, then approximately solves an 
equality-constrained quadratic program (EQP) over a different trust region to en- 
hance the step. This algorithm is implemented in the KNITRO package for nonlinear 
optimization as the KNITRO- ACTIVE option. 

Friedlander et al. [l4j solve a problem of the form (|4.ip for the case of nonlinear 
programming, where h is the sum of the objective function / and the indicator function 
for the equalities and the inequalities defining the feasible region. The resulting step 
can be enhanced by solving an EQP. 

Other related literature on composite nonsmooth optimization problems with gen- 
eral finite polyhedral convex functions (Section [3|) includes the papers of Yuan [411142] 
and Wright ^39j . The approaches in [42l [39] solve a linearized subproblem like (|5.ip , 
from which an analog of the "Cauchy point" for trust-region methods in smooth un- 
constrained optimization can be calculated. This calculation involves a line search 



min h[c{x) + Vc{x)d) subject to |d| < p, 



(5.1) 
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along a piecewise quadratic function and is therefore more complicated than the cal- 
culation in |13| . but serves a similar purpose, namely as the basis of an acceptability 
test for a step obtained from a higher-order model. 

For general outer functions h, the theory is more complex. An early approach 
to regularized minimization problems of the form (|2.3p for a lower semicontinuous 
convex function | • |* is due to Fukushima and Mine |15j : they calculate a trial step 
at each iteration by solving the linearized problem (j4.3p . 

The case when the map c is simply the identity has a long history. The iter- 
ation Xk+i — Xk + dk, where dk minimizes the function d i—s- h{xk -f d) + is 
the well-known proximal point method. For lower semicontinuous convex functions 
h, convergence was proved by Martinet [2l] and generalized by Rockafellar [32]. For 
nonconvex h, a good survey up to 1998 is by Kaplan and Tichatschke [19]. Pennanen 
[28j took an important step forward, showing in particular that if the graph of the 
subdifferential dh agrees locally with the graph of the inverse of a Lipschitz function 
(a condition verifiable using second-order properties including prox-regularity — see 
Levy [2ll Cor. 3.2]), then the proximal point method converges linearly if started 
nearby and with regularization parameter bounded away from zero. This result 
was foreshadowed in much earlier work of Spingarn [37] , who gave conditions guaran- 
teeing local linear convergence of the proximal point method for a function h that is 
the sum of lower semicontinuous convex function and a C2 function, conditions which 
furthermore hold "generically" under perturbation by a linear function. Inexact vari- 
ants of Pennanen's approach are discussed by lusem, Pennanen, and Svaiter [17] and 
Combettes and Pennanen [8]. In this current work, we make no attempt to build on 
this more sophisticated theory, preferring a more direct and self-contained approach. 

The issue of identification of the face of a constraint set on which the solution 
of a constrained optimization problem lies has been the focus of numerous other 
works. Some papers show that the projection of the point x — crV/(a;) onto the 
feasible set (for some fixed ti > 0) lies on the same face as the solution x, under 
certain nondegeneracy assumptions on the problem and geometric assumptions on 
the feasible set. Identification of so-called quasi-polyhedral faces of convex sets was 
described by Burke and More [2] . An extension to the nonconvex case is provided by 
Burke [1 , who considers algorithms that work with linearizations of the constraints 
describing the feasible set. Wright [ID] considers surfaces of a convex set that can be 
parametrized by a smooth algebraic mapping, and shows how algorithms of gradient 
projection type can identify such surfaces once the iterates are sufficiently close to a 
solution. Lewis .22] and Hare and Lewis [16] extend these results to the nonconvex, 
nonsmooth case by using concepts from nonsmooth analysis, including partly smooth 
functions and prox-regularity. In their setting, the concept of a identifiable face of 
a feasible set becomes a certain type of manifold with respect to which h is partly 
smooth (see Definition 11.21 above). Their main results give conditions under which 
the active manifold is identified from within a neighborhood of the solution. 

Another line of relevant work is associated with the VIA theory introduced by 
Lemarechal, Oustry, and Sagastizabal [20 and subsequently elaborated by these and 
other authors. The focus is on minimizing convex functions f{x) that, again, are 
partly smooth — smooth ( "U-shapcd" ) along a certain manifold through the solu- 
tion a;, but nonsmooth ("V-shaped") in the transverse directions. Mifflin and Sagas- 
tizabal [35] discuss the "fast track," which is essentially the manifold containing the 
solution X along which the objective is smooth. Similarly to [13], they are interested 
in algorithms that identify the fast track and then take a minimization step for a 
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certain Lagrangian function along this track. It is proved in [35i Theorem 5.2] that 
under certain assumptions, when x is near the proximal point x + d obtained by 
solving the problem 

min /(x + d) + (5.2) 

d 2 

lies on the fast track. This identification result is similar to the one we prove in 
Section 16. 5| but the calculation of d is different. In our case of / — hoc, (15. 2p 
becomes obtain 

mm /i(c(a; + d)) + (5.3) 

whose optimality conditions are, for some fixed current iterate x, 

Vc{x + d)*v + ^d ^ 0, for some v e dh{c{x + d)) . (5.4) 

Compare this system with the optimality conditions ()4.2p from subproblcm (14. 1|) : 

Vc(a;)*u + fid — 0, for some v G dh[c{x) + \Ic{x)d) . 

In many applications of interest, c is nonlinear, so the subproblem (|5.3p is generally 
harder to solve for the step d than our subproblem (|4.ip . 

Mifflin and Sagastizabal [25j describe an algorithm in which an approximate so- 
lution of the subproblem (|5.2p is obtained, again for the case of a convex objective, 
by making use of a piecewise linear underapproximation to their objective /. The 
approach is most suitable for a bundle method in which the piecewise-linear approxi- 
mation is constructed from subgradients gathered at previous iterations. Approxima- 
tions to the manifold of smoothness for / are constructed from the solution of this 
approximate proximal point calculation, and a Newton-like step for the Lagrangian 
is taken along this manifold, as envisioned in earlier methods. Daniilidis, Hare, and 
Malick [9j use the terminology "predictor-corrector" to describe algorithms of this 
type. Their "predictor" step is the step along the manifold of smoothness for /, while 
the "corrector" step (|5.2|) eventually returns the iterates to the correct active manifold 
(see PI Theorem 28]). Miller and Malick [2i51 show how algorithms of this type are 
related to Newton-like methods that have been proposed earlier in various contexts. 

Various of the algorithms discussed above make use of curvature information for 
the objective on the active manifold to accelerate local convergence. The algorith- 
mic framework that we describe in Section [7] could easily be modified to incorporate 
similar techniques, while retaining its global convergence and manifold identification 
properties. 

6. Properties of the Proximal Linearized Subproblem. We show in this 
section that when h is prox-regular at c, under a mild additional assumption, the 
subproblem (|4.ip has a local solution d with norm 0{\x — x\), when the parameter 
/i is sufficiently large. When h is convex, this solution is the unique global solution 
of the subproblem. We show too that a point Xnow near x + d can be found such 
that the objective value h{c{xncw)) is close to the prediction of the linearized model 
h{c{x) -\- Vc{x)d). Further, we describe conditions under which the subproblem cor- 
rectly identifies the manifold M with respect to which h is partly smooth at the 
solution of (|l.ip . 
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6.1. Lipschitz Properties. We start with technical prehminaries. AUowing 
nonlipschitz or extended- valued outer functions h in our problem (jl.ip is conceptually 
appealing, since it allows us to model constraints that must be enforced. However, 
this flexibility presents certain technical challenges, which we now address. We begin 
with a simple example, to illustrate some of the difficulties. 

Example 6.1. Define a function c : R ^ by c{x) = (a;,a;^), and a lower 
semicontinuous convex function h : —i- R by 

1 , N { y {z> 2?/^) 

^(^'^) = |+oo {z<2yi 

The composite function hoc is simply ^{o}i the indicator function o/{0}. This function 
has a global minimum value zero, attained uniquely by x ~ Q. 

At any point a; G R, the derivative map Vc(x) : R ^ R'^ is given by Vc(x)(i = 
(c?, 2xd) for d G R. Then, for all nonzero x, it is easy to check 

h{c{x) + Vc{x)d) = +00 for all d G R", 

so the corresponding proximal linearized subproblem \4-l^ has no feasible solutions: 
its objective value is identically +00. 

The adjoint map Vc(a;)* : R'^ — s- R is given by 'S/c{x)*v — vi for v G R^, and 

a/i(0,0) ^ {ve R^ V2< 0}. 

Hence the criticality condition il.3\) has no solution v £ . 

This example illustrates two fundamental difficulties. The first is theoretical: the 
basic criticality condition ()1.3p may be unsolvable, essentially because the chain rule 
fails. The second is computational: if, implicit in the function h, are constraints on 
acceptable values for c{x), then curvature in these constraints can cause infeasibil- 
ity in linearizations. As we see below, resolving both difficulties requires a kind of 
"transversality" condition common in variational analysis. 

The transversality condition we need involves the "horizon subdifferential" of the 
function /i : R" -> R at the point c G R", denoted d°°h{c). This object, which 
recurs throughout our analysis, consists of a set of "horizon subgradients" , capturing 
information about directions in which h grows faster than linearly near c. Useful to 
keep in mind is the following fact: 

d°°h{c) = {0} if h is locally Lipschitz around c. 

This condition holds in particular for a convex function h that is continuous at c. 
Readers interested only in continuous convex functions h may therefore make the 
substantial simplification d°°h{c) = {0} throughout the analysis. For general convex 
h : R™ ^ RU {+00}, for any point c in the domain dom/i we have the following 
relationship between the horizon subdifferential and the classical normal cone to the 
domain (see [Ml Proposition 8.12]): 

We seek conditions guaranteeing a reasonable step in the proximal linearized 
subproblem (|4.ip . Our key tool is the following technical result. 

Theorem 6.1. Consider a lower semicontinuous function h: R™ — > R, a point 
z G R™ where h{z) is finite, and a linear map G : R" ^ R™ satisfying 

d°°h{z) n Null(G*) ^ {0}. 
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Then there exists a constant 7 > such that, for all vectors z G R" and linear maps 
G : R" — > R™ with {z, G) near {z, G), there exists a vector w G R™ satisfying 

\w\ < 7I2 — 2] and h{z + Gw) < h(z) + "/\z — z\. 

Notice that this result is trivial if h is locally Lipschitz (or in particular continuous 
and convex) around z, since we can simply choose w = 0. The nonlipschitz case is 
harder; our proof appears below following the introduction of a variety of ideas from 
variational analysis whose use is confined to this subsection. We refer the reader to 
Rockafellar and Wets [34] or Mordukhovich [27] for further details. First, we need a 
"metric regularity" result. Since this theorem is a fundamental tool for us, we give 
two proofs, one of which specializes the proof of Theorem 3.3 in Dontchev, Lewis and 
Rockafellar [TT], while the other sets the result into a broader context. 

Theorem 6.2 (uniform metric regularity under perturbation) . Suppose that the 
closed set-valued mapping F: R^ R' is metrically regular at a point u £ R^ for a 
point V e F{u): in other words, there exist constants K,a > such that all points 
u G Ba{u) and v G Ba{v) satisfy 

dist(w,i^~^(w)) < K • dist(u,F(w)). (6.1) 

Then there exist constants (5, 7 > such that all linear maps H : —>■ P? with 
\\H\\ < 6 and all points u G Bs{u) and v G Bs{v) satisfy 

dist{u,{F + ny^iv)) < -fdist{v,{F + H){u)). (6.2) 

Proof. For our first approach, we follow the notation of the proof of [TTl Theo- 
rem 3.3]. Fix any constants 

Ag(0,k-^), aG (0,^(1 -KA)min{l,K}) , ,^ G (o, min { |, a}) . 

Then the proof shows inequality (|6.2p . if we define 7 = k/(1 — kX). 

As an alternative, more formal approach, denote the space of linear maps from R^ 
to R*^ by L(R^, R''), and define a mapping g : i(RP, R'') x R^ -> R' and a parametric 
mapping '■ R'' — > R'^ by g{H, u) — gniu) = Hu for maps iJ G £(R^, R'') and points 
u G R^. Using the notation of [TUl Section 3], the Lipschitz constant ^5](0; u, 0), is by 
definition the infimum of the constants p for which the inequality 

d(v,gH{u)) < pd{u,g]j^{v)) 

holds for all triples {u,v,H) sufficiently near the triple (u,5,0). A quick calculation 
shows that this constant is zero. We can also consider + 5 as a set- valued mapping 
from L{RP, R?) x RP to R? , defined by (F -I- g){H,u) = F{u) + Hu, and then the 
parametric mapping {F + g)H' R^ ^ R' is defined in the obvious way: in other 
words, {F + g)H{u) = F{u) + Hu. According to [lOl Theorem 2], we have the 
following relationship between the "covering rates" for F and F + g: 

r[F + g]{0;u,v) = r[F]{u,v). 

The reciprocal of the right-hand side is, by definition, the infimum of the constants 
K > such that inequality (|6.ip holds for all pairs (u, v) sufficiently near the pair 
{u,v). By metric regularity, this number is strictly positive. On the other hand. 
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the reciprocal of the left-hand side is, by definition, the infimum of the constants 
7 > such that inequality (|6.2p holds for all triples {u, v, H) sufficiently near the pair 
{u,v,0). 

The following result depends on an assumption about the normal cone to S at 
a point s e S", denoted Ns{s), the basic building block for variational analysis (see 
Rockafellar and Wets [34l or Mordukhovich [27]). When S is convex, it coincides 
exactly with the classic normal cone from convex analysis, while for smooth manifolds 
it coincides with the classical normal space. 

Corollary 6.3. Consider a closed set S d with G S, and a linear map 
A: ^ satisfying 

7Vs(0)nNuU(A*) = {0}. 

Then there exists a constant 7 > such that, for all vectors w S R' and linear maps 
A : FC with (v,A) near (0,^), the inclusion 

v + Aue S 

has a solution u € satisfying \u\ < "f\v\. 

Proof. Corresponding to any linear map A: R^ — > R', define a set- valued mapping 
Fa - R^ ^ R' by Fa{u) = An — S. k coderivative calculation shows, for vectors 

D*FAm)[v) ^ ( \\l ^-.(«)) 

^ ' ' 1^ (/) (otherwise). 

Hence, by assumption, the only vector v £ R^ satisfying G D* Fa{0\0){v) is zero, so 
by |341 Thm 9.43], the mapping F^ is metrically regular at zero for zero. Applying the 
preceding theorem shows that there exist constants i5, 7 > such that, if \\A — A\\ < 5 
and |w| < S, then we have 

dist{0, FX\-v)) < 7dist(-'y,FA(0)), 

or equivalently, 

dist(0, A-\S - v)) < 7dist {v, S). 

Since G S, the right-hand side is bounded above by j\v\, so the result follows. □ 
We are now ready to prove the result we claimed at the outset of this subsection. 
Proof of Theorem \ 6.1i Without loss of generality, we can suppose z — and 

h{0)_ =0. Let 5 C R" X R be the epigraph of /i, and define a map A: R" x R -> R™ x R 

by A{z, t) = {Gz, t). Clearly we have NuU(^*) = NuU(G*) x {0}, so [34. Theorem 8.9] 

shows 

iVs(0,0)nNull(^*) = {(0,0)}. 

For any vector z and linear map G with (z, G) near (z, G), the vector (z, 0) G R™ x R is 
near the vector (z, 0) and the map [w, r) ^ {Gw, t) is near the map (w, t) i-^ {Gw, t). 
The previous corollary shows the existence of a constant 7 > such that, for all such 
z and G, the inclusion 



(z,0) + (Gw,r) G S 
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has a solution satisfying |(t(j,r)| < 7|(z, 0)|, and the result follows. □ 

We end this subsection with another tool to be used later: the proof is a straight- 
forward application of standard ideas from variational analysis. Like Theorem 16. 2[ 
this tool concerns metric regularity, this time for a constraint system of the form 
F{z) e S for an unknown vector z, where the map F is smooth, and 5 is a closed set. 

Theorem 6.4 (metric regularity of constraint systems). Consider a map 
F: ^ R"^ , a point z G R^, and a closed set S C R!^ containing the vector F{z). 
Suppose the condition 

Ns{F{z)) C^^n\\{VF{z)*) = {0} 

holds. Then there exists a constant k > such that all points z RF near z satisfy 
the inequality 

dist(z, F-^{S)) < K ■ dist(i^(z), S'). 



Proof. We simply need to check that the set- valued mapping G : R-^ ^ R'' defined 
by G{z) = F{z) — S* is metrically regular z for zero. Much the same coderivative 
calculation as in the proof of Corollary 16.31 shows, for vectors v E R^, the formula 

D*G(z-|0)(.) = ( (^^f^-.(^")) 
y ijj (otherwise). 

Hence, by assumption, the only vector v € satisfying e D*G{z\0){v) is zero, so 
metric regularity follows by [34l Thm 9.43]. □ 

6.2. The Proximal Step. We now prove a key result. Under a standard 
transversality condition, and assuming the proximal parameter ^ is sufficiently large 
(if the function h is nonconvex), we show the existence of a step d = 0(|a; — a;|) in the 
proximal linearized subproblem (|4.ip with corresponding objective value close to the 
critical value h(c). 

When the outer function h is locally Lipschitz (or, in particular, continuous and 
convex), this result and its proof simplify considerably. First, the transversality con- 
dition is automatic. Second, while the proof of the result appeals to the technical 
tool we developed in the previous subsection f Theorem 16. ip . this tool is trivial in the 
Lipschitz case, as we noted earlier. 

Theorem 6.5 (proximal step). Consider a function h: R™ — > R and a map 
c: R" — > R™. Suppose that c is around the point x e R", that h is prox-regular at 
the point c = c{x), and that the composite function h o c is critical at x. Assume the 
transversality condition 

9°°/i(c)nNull(Vc(i)*) {0}. (6.3) 

Then there exist numbers /2 > 0, (5 > 0, and p > 0, and a mapping d : Bs{x) x (/I, oo) — > 
R" such that the following properties hold. 

(a) For all points x G Bs{x) and all parameter values fi > ft, the step d{x,p) is 
a local minimizer of the proximal linearized subproblem and moreover 
\d{x, /i)| < p\x — x\. 

(b) Given any sequences Xr x and pr > p., then if either r ^ or 
h(^c{xr)) h(c), we have 

h[c{Xr) + '^c{Xr)d{Xr, ^r)) ~^ h{c). (6.4) 



15 



(c) When h is convex and lower semicontinuous, the results of parts (a) and (h) 
hold with (1 = 0. 

Proof. Without loss of generality, suppose a; = and c = c(0) — 0, and further- 
more h(0) = 0. By assumption, 

e d{hoc){0) C Vc(0)*a/i(0), 

using the chain rule 34, Thm 10.6], so there exists a vector 

V e 9/i(0) nNull(Vc(0)*). 

We first prove part (a). By prox-regularity, there exists a constant p > such 

that 

h{z)>{v,z)-^\z\' (6.5) 

for all small vectors z G R'". Hence, there exists a constant 6i > such that Vc is 
continuous on Bs^ (0) and 

h^cAd) > {v,c{x)+^c{x)d)~^\c{x)+Vcix)d\^ + ^\d\^ 

for all vectors x,d £ Bs^ (0). As a consequence, we have that 

KAd) > min \{v,c{x)+Vc{x)d)-^\c{x)+Vcix)d\^} + ^\d\\ 

|a:|<5i, — (5i I Z J L 

and the term in braces is finite by continuity of c and Vc on B^^ (0). Hence by choosing 
/2 sufficiently large (certainly greater than p||Vc(0)|p) we can ensure that 

hx,fi{d) > 1 whenever |a;| < Si, \d\ — 6i. 

Then for x e Bs^^^O), \d\ — Si, and fJ-> fl, we have 

h^Ad) = h^Ad) + ^(m - fi)\df >1 + ^{H- n)Sl. (6.6) 

Since c is at 0, there exist constants /3 > and S2 € (0,(5i) such that, for all 
X G -653(0), the vector 

z{x) — c{x) — Vc{x)x (6.7) 

satisfies \z{x)\ < Setting G — \7c{x), G — Vc(0), z = 0, and z = z{x), we now 

apply Theorem l6.1l Hence for some constants 7 > and 63 € (0, S2), given any vector 
X G -853(0), there exists a vector d G R" (defined by c? = w — x, in the notation of the 
theorem) satisfying 

\x + d\< 7|z(x)| < j(3\x\^ 
h{c{x) + \'c{x)d) < -f\z{x)\ < jP\xf. 

We deduce the existence of a constant ^4 G {0,63) such that, for all x G Bs^{0), the 
corresponding d satisfies 

\d\ < \x\+-fP\x\'' < 61, 
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and 

h^.Ad) = Hcix) +Vcix)d) + ^\d\^ 

We denote this d by d{x). 

The lower semicontinuous function h^.^ must have a minimizer (which we denote 
d{x,fj,)) over the compact set Bs-^{0), and the inequality above implies the corre- 
sponding minimum value is majorized by hx^^{d{x)'j , and thus is strictly less than 
1 + (1/2) (/i — p,)Sl. But inequality (|6.6|) implies that this minimizer must lie in the 
interior of the ball -65^(0); in particular, it must be an unconstrained local minimizer 
of hx^^- By setting S — 64, we complete the proof of the first part of (a). Notice 
furthermore that for x S ^^^(O), we have 

h{c{x) +\/c{x)d{x,n)) (6.8) 
< K,4d{x,ti)) < K,^{d{x)) < ^(3\x\^ + !^{\x\+jP\x\^)\ 

We now prove the remainder of part (a), that is, uniform boundedness of the 
ratio |d(a;, /Lt)|/|a;|. Suppose there are sequences Xr S Bs{x) and fj.r > such that 
|(i(a::r, /ir)|/|a^r| — ^ 00. Since \d{xr,fJ.r)\ < ^1 by the arguments above, we must have 
Xr — » 0. By the arguments above, for all large r we have the following inequalities: 

lP\Xr\' + ^{\Xr\+lf3\Xr\^y 

> {v, c{Xr) + \7c{Xr)dr) — ^\c{Xr) + \7c{Xr)dr\'^ + ^\dr\'^ . 

Dividing each side by {l/2)^j,\xr\'^ and letting r — > 00, we recall 

Air > A > p||Vc(0)|p > 

and observe that the left-hand side remains finite, while the right-hand side is even- 
tually dominated by (1 — p|| Vc(0)|p/^r)|rfrP/|a;rP, which approaches 00, yielding a 
contradiction. 

For part (b), suppose first that /i^lairp — > 0. By substituting (x,^) — {xr,^J,r) 
into (|6.8p . we have that 

limsnp h(^c{xr) + '^c{xr)d{xr, fJ^r)) < 0. (6.9) 

From part (a), we have that \d{xr, fir)\/\xr\ is uniformly bounded, hence d{xr, fJ-r) ~> 
and thus 0(2;^) + \7c{xr)d{xr, ^ir) ~^ 0. Being prox-regular, h is lower semicontinuous 
at 0, so 

liminf /i(c(xr) 4- Vc(a;r)c?(a;r, /Xr)) > 0. 
By combining these last two inequalities, we obtain 

h(^c{Xr) + W c{Xr)d{Xr , Hr)) 0, 
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as required. 

Now suppose instead that h(^c{xr)) ^ h{c) = 0. We have from (|6.8p that 



h(c{Xr) +Vc{Xr)d{Xr, ^J.r)) < hx^,^_i^{d{Xr , ^JLr)) < h.j,^ ^ h{c{Xr)) ■ 

By taking the hm sup of both sides we again obtain (|6.9p , and the result follows as 
before. 

For part (c), when h is lower semicontinuous and convex, the argument simplifies. 
We set p = in ()6.5p and choose the constant (5 > so the map Vc is continuous on 
-65(0). Choosing the constants /? and 7 as before, Theorem 16.11 again guarantees the 
existence, for all small points x, of a step d{x) satisfying 

h{c{x) + Vc{x)d{x)) < "fP\x\'^. 

We now deduce that the proximal linearized objective h^^^ is somewhere finite, so has 
compact level sets, by coercivity. Thus it has a global minimizer d(x, fi) (unique, by 
strict convexity), which must satisfy the inequality 

h{c{x) + Vc{x)d{x,fi)) < h{c{x) + Vc{x)d{x)) < Jl3\xf. 

The remainder of the argument proceeds as before. □ 

We discuss Theorem l6.5( b) by giving a simple example of a function prox-regular 
at c{x) such that for sequences — > a; and /i^ — > 00 that satisfy neither /i^lxr— —> 
nor ft,(c(r)) — > /i(c(a;)), the conclusion (|6.4p fails to hold. For a scalar x, take c{x) = x 
and 



h{c) 



-c (c < 0) 
1 + c (oO). 



The unique critical point is clearly x = with c{x) — and h[c{x)) — 0, and this 
problem satisfies the assumptions of the theorem. Consider x > 0, for which the 
subproblem (|4.ip is 

■ I. /JN ,/ J^ ^^^2 i-x-d+i^d"^ {x + d<Q) 
mm hx a(d) = h(x + d) + -d-^ ^ < o 
d '^^ ' ^ ' 2 \l + x + d+1d^ {x + d>0). 

When firXr G (0,1], then dr = —Xr is the only local minimizer of hx^,^^- When 
HrXr > 1, the situation is more interesting. The value dr = —fJ-r^ minimizes the 
"positive" branch of h^,.^^^, with function value 1 + a;^ — (2/1^)^"'^, and there is a 
second local minimizer at dj. — — x,., with function value (/i,./2)a;^. (In both cases, 
the local minimizcrs satisfy the estimate \dr\ = 0{\xr — x\) proved in part (a).) 
Comparison of the function values show that in fact the global minimum is achieved 
at the former point — d^ = —fJ-^^ — when Xr > + -\/2/ir If this step is taken, we 
have Xr + dr > 0, so the new iterate remains on the upper branch of h. For sequences 
Xr — fJ^r^ + and fir 00, we thus have for the global minimizer dr ~ — /i,- of 

hxr.tJ.r that h{c{xr) + Vc(a;,.)c?r) > 1 for all r, while /i(c(a;)) — 0, so that (|6.4p does 
not hold. 

6.3. Restoring Feasibility. In the algorithmic framework that we have in mind, 
the basic iteration starts at a current point a; g R" such that the function h is finite at 
the vector c{x). We then solve the proximal linearized subproblem (14. ip to obtain the 
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step d = d{x, /i) e R" discussed earlier in this section. Under reasonable conditions we 
showed that, for x near the critical point x, we have d = 0{\x — x\) and furthermore 
we know that the value of h at the vector c{x) + Wc{x)d is close to the critical value 
h{cix)). 

The algorithmic idea is now to update the point a; to a new point x + d. When 
the function h is Lipschitz, this update is motivated by the fact that, since the map 
c is C^, we have, uniformly for x near the critical point x, 

c{x + d)-{c{x)+Wc{x)d) = 0(MP) 

and hence 

h{c{x + d)) ~h{(c(x)+Wc{x)d)) = 0(\d\^). 

However, if h is not Lipschitz, it may not be appropriate to update a; to a; + d: the 
value h{c{x + d)) may even be infinite. 

In order to take another basic iteration, we need somehow to restore the point 
X + d to feasibility, or more generally to find a nearby point with objective value not 
much worse than our linearized estimate h{c{x) + Vc(x)(i). Depending on the form 
of the function ft,, this may or may not be easy computationally. However, as we 
now discuss, our fundamental transversality condition (|6.3p . guarantees that such a 
restoration is always possible in theory. In the next section, we refer to this restoration 
process as an "efhcient projection." 

Theorem 6.6 (linear estimator improvement). Consider a map c: R" R™ 
that is around the point x G R", and a lower semicontinuous function h: R™ —^ R 
that is finite at the vector c = c{x). Assume the transversality condition I16.3\} holds, 
namely 

9°°/i(c)nNull(Vc(x)*) = {0}. 

Then there exists constants j,S > such that, for any point x £ Bs{x) and any step 
d £ Bs{0) C R" for which \h{c{x) + \'c{x)d)~h{c)\ < 5, there exists a point x^cw € R" 
satisfying 

- {x + d)\ < "i\d\^ and h{c{xnew)) < h{c{x) +\7c{x)d) +j\d\'^. (6.10) 

Proof Define a map : R" x R ^ R" x R by F{x, t) = {c{x),t). Notice that 
the epigraph epih is a closed set containing the vector F(^x, h{cj). Clearly we have 

Null(vF(x,ft(c))*) = NuU(Vc(S)*) X {0}. 

On the other hand, using Rockafellar and Wets O Theorem 8.9], we have 

(y,0) eTVcpi ft (c,Mc)) ^ 2/e9°°/i(c). 

Hence the transversality condition is equivalent to 

iVepi4c,/i(c))nNull(v^^(x,ft(c))*) = {0}. 

We next apply Theorem 16.41 to deduce the existence of a constant k > such 
that, for all vectors {u,t) near the vector {x,h{cj) we have 

dist((w,t),F~^(epi/i)) < k • dist(i^(u, t), epi /i). 
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Thus there exists a constant S > such that, for any point x G Bs{x) and any step 
d e R" satisfying |d| < S and \h{c{x) + Wc{x)d) — h{c)\ < 6, we have 

dist|^(a; + d,h{c (x) +Wc{x)d)),F"'^(epih)] 

< K ■ dist(F(x + d, h{c{x) + S/c{x)d)) ,epih) 
= K ■ dist^(c(a; + d), h(c{x) + Vc(x)(i)) , epi h}j 

< K- \c{x + d) - {c{x) + \7c{x)d)\, 

since 

{c{x) +Vc{x)d,h{c{x) +Vc{x)d)) G epih. 

Since the map c is C^, by reducing 5 if necessary we can ensure the existence of a 
constant 7 > such that the right-hand side of the above chain of inequahties is 
bounded above by 7|c?p. 

We have therefore shown the existence of a vector 

{xncw,t) e F~^{epih) 

satisiying the inequahties 

\xnow - {x + d)\ < j\d\'^ and \t - h{c{x) + \/c{x)d)\ < j\d\'^ . 

We therefore know t > h(^c{xncw)) : so the result foUows. □ 

6.4. Uniqueness of the Proximal Step and Convergence of Multipliers. 

Our focus in this subsection is on uniqueness of the local solution of (|4.ip near 0, 
uniqueness of the corresponding multiplier vector, and on showing that the solution 
d{x, fi) of (|4.ip has a strictly lower subproblem objective value than d = {). For the 
uniqueness results, we strengthen the trans versality condition (|6.3p to a constraint 
qualification that we now introduce. 

Throughout this subsection we assume that the function h is prox-regular at 
the point c. Since prox-regular functions are (Clarke) subdifferentially regular, the 
subdifferential dh{c) is a closed and convex set in R™, and its recession cone is exactly 
the horizon subdifferential d°°h{c) (see [Ml Corollary 8.11]). Denoting the subspace 
parallel to the afline span of the subdifferential by par 9/i(c), we deduce that 

a°°/i(c) C par a/i(c). 

Hence the "constraint qualification" that we next consider, namely 

par9/i(c)nNuU(Vc(x)*) = {0} (6.11) 

implies the transversality condition (|6.3p . 

Condition (j6.1ip is related to the linear independence constraint qualification in 
classical nonlinear programming. To illustrate, consider again the case of Section [31 
where the function h is finite and polyhedral: 

h{c) = max{(/ij, c) + Pi] 
for some given vectors hi £ R™ and scalars (3i. Then, as we noted. 



dh{c) = conv{hi : i e /}, 
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where / is the set of active indices, so 



par dh{c) 
Thus condition (|6.1ip states 







■ " 


1 








(6.12) 



By contrast, the Unear independence constraint quahfication for the corresponding 
nonhnear program p.3p at the point (x, — /i(c)) is 



iG/ 



" Vc(x)*/i, ■ 




■ " 


1 








<4> Ai = (i e /), 



which is a stronger assumption than condition (|6.12p . 

We now prove a straightforward technical rcsuh that addresses two issues: ex- 
istence and boundedness of muhiphers for the proximal subproblem (|4.ip . and the 
convergence of these multipliers to a unique multiplier that satisfies criticality con- 
ditions for (|l.ip . when the constraint qualification (|6.1ip is satisfied. The argument 
is routine but, as usual, it simplifies considerably in the case of h locally Lipschitz 
(or in particular convex and continuous) around the point c, since then the horizon 
subdifferential d°°h is identically {0} near c. 

Lemma 6.7. Consider a function h: R™ — > R and a map c: R" ^ R™. Suppose 
that c is around the point x (£ R" , that h is prox-regular at the point c = c{x), and 
that the composite function ho c is critical at x. 

When the transversality condition I6'.g)j holds, then for any sequences /i,. > and 
Xr X such that fj,r\xr — a;| — > 0, and any sequence of critical points € R" for the 
corresponding proximal linearized subproblems j[ ) satisfying the conditions 

dr — 0{\xr — x\) and h[c{xr) + '^c{xr)dr) h{c), 

there exists a bounded sequence of vectors Vr £ R™ that satisfy 



= \7c{Xr)*Vr + IJ-rdr, 
Vj, e dh{c(xr) + Vc(Xr)(ir)- 

When the stronger constraint qualification h6.11\) holds, in place of ([£ 
multipliers v e R™ solving the criticality condition U.3\) . namely 

dh{c) n Null(Vc(x)*) 



(6.13a) 
(6.13b) 

the set of 
(6.14) 



is in fact a singleton {w}. Furthermore, any sequence of multipliers {vr} satisfying 
the conditions above converges to v. 

Proof. We first assume (|6.3|) . and claim that 

(6.15) 



(9°°/i(c(a;r) + Vc(xr)4) nNuU(Vc(a;^)*) {0} 



for all large r. Indeed, if this property should fail, then for infinitely many r there 
would exist a unit vector Vr lying in the intersection on the left-hand side, and any 
limit point of these unit vectors must lie in the set 



a°°/i(c)nNuU(Vc(i)*), 
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(6.16) 



by outer semicontinuity of the set-valued mapping d°°h at the point c 34, Proposi- 
tion 8.7], contradicting the transversahty condition (|6.3p . As a consequence, we can 
apply the chain rule to deduce the existence of vectors Vr G R" satisfying (|6.13p . This 
sequence must be bounded, since otherwise, after taking a subsequence, we could 
suppose \vr\ oo and then any limit point of the unit vectors |t'r|~^'yr would lie in 
the set ()6.16|) , again contradicting the transversahty condition. The first claim of the 
theorem is proved. 

For the second claim, we assume the constraint qualification (16. lip and note as 
above that it implies the transversahty condition (j6.3p . so the chain rule implies that 
the set (|6.14l) is nonempty. This set must therefore be a singleton {v}, using (|6.1ip 
again. Using boundedness of {vr}, and the fact that ^rdr — > 0, we have by taking 
limits in (|6.13p that any limit point of {vr} lies in (|6.14p (by outer semicontinuity of 
dh at c), and therefore v,, v. U 

Using Theorem l6.51 we show that the local minimizers of h^^.^^ satisfy the desired 
properties, and in addition give a strict improvement over in the subproblem (14. ip . 

Lemma 6.8. Consider a Junction h: R™ — > R and a map c: —^ R™. Suppose 
that c is around the point i € R", that h is prox-regular at the point c = c{x), 
that the composite function ho c is critical at x, and that the transversahty condition 
id. 3\) holds. Defining p, as in Theorem \6.5l let iir > fi and Xr x be sequences such 
that fj,r\xr — x\ ^ . Then for all r .sufficiently large, there is a local minimizer dr of 
hxr,p,r- such that 

dr = 0{\xr — x\) and h{c{xr) + Wc{xr)dr) h{c). (6.17) 
Moreover, if ^ d{h o c){xr) for all r, then dr ^ and 

/ix.,M.K) < /ix.,M.(0) (6.18) 

for all r sufficiently large. 

Proof. Existence of a sequence of local minimizers dr of hx^,^^ with the properties 
(|6.17p follows from parts (a) and (b) of Theorem 16.51 when we set dr = d{xr,lir) and 
use fir > p.. We now prove (|6.18p . From (|6.17p and Lemma [6?7l we deduce the existence 
of Vr satisfying (|6.13p . If we were to have dr — 0, then these conditions reduce to 

\/c{Xr)*Vr — 0, Vr £ dh(^c{Xr)) , 

SO that € d{hoc){xr), by subdifferential regularity of h. Hence we must have dr ^ 0. 
By prox-regularity, we have 

h{c{Xr^ > h{c{Xr) + V c{Xr)dr) + {Vr, ~Vc{Xr)dr) — c{Xr)dr\^ 
= h{c{Xr) + \7c{Xr)dr) + /irMrP " ^\'^ c{Xr)dr\'^ 

= /i(c(a;.) + Vc{xr)dr) + + l^L_PM^^\d^\'^ 

where the final inequality holds because p > p||Vc(i)|p. □ 

Returning to the assumptions of Theorem 16. 51 but now with the constraint qual- 
ification (|6.1ip replacing the weaker transversahty condition (|6.3p . we can derive lo- 
cal uniqueness results about critical points for the proximal linearized subproblem. 



by (Pr^ 



by gH) 
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When the outer function h is convex, uniqueness is obvious, since then the proximal 
linearized objective /i^ j; is strictly convex for any ^ > 0. For lower functions, the 
argument is much the same: such functions have the form g — k\ ■ locally, for some 
continuous convex function g, so again h^^^ is locally strictly convex for large /i. For 
general prox-regular functions, the argument requires slightly more care. 

Theorem 6.9 (unique step). Consider a function h: R"* —^ R and a map 
c: R" — !■ R™. Suppose that c is around the point x £ R", that h is prox-regular 
at the point c — c{x), and that the composite function h o c is critical at x. Suppose 
furthermore that the constraint qualification i6.11\) holds. Then there exists fi > 
such that the following properties hold. Given any sequence {Hr} with /Zr > /i for 
all r and any sequence Xr x such that /i^jxr — a;| ^ 0, there exists a sequence of 
local minimizers dr ofh^^ ^^^^ and a corresponding sequence of multipliers with the 
following properties: 

£ dhx^.f_i^{dr), dr — 0{\xr — x\), and h(^c{xj.) + 'Vc{xr)dj.) h{c), (6.19) 

as r oo, and satisfying I16.13\) . with Vr v, where v is the unique vector that solves 
the criticality condition il.3\) . Moreover, d^ is uniquely defined for all r sufficiently 
large. 

In the case of a convex, lower semicontinuous function h : R™ — > (— oo, +oo], the 
result holds with /i = 0. 

Proof. The existence of sequences {dr} and {vr} with the claimed properties 
follows from Theorem 16.51 and Lemma 16.71 We need only prove the claim about 
uniqueness of the vectors dr, and the final claim about the special case of h convex 
and lower semicontinuous. 

Throughout the proof, we choose ft > fj,, where p, is defined in Theorem 16. 51 

We first show the uniqueness of dr in the general case. Since the function h is 
prox-regular at c(x), its subdifferential dh has a hypomonotone localization around 
the point {c{x),v). In other words, there exist constants p > and e > such that 
the mapping T: R™ ^ R™ defined by 

TU,\ = l dh{y)nB,{v) (y&B,{c{x)), \h{y) - h{c{x))\ < e) 
\ (otherwise) 

has the property 

Z e T(y) and Z' e r(j/') {z' - z^y' ^ y) > ^p\y' - . 

(See [Ml Example 12.28 and Theorem 13.36].) If the uniqueness claim does not hold, 
we have by taking a subsequence if necessary that there is a sequence Xr ^ x and 
distinct sequences of d}, ^ in R" satisfying the conditions 

<^ dh.j;^^^,^{di), dr = 0{\xr - x\) ^ 0, and h{c{xr) + \7c{xr)dl) ^ h{c{x)) , 

as r — > oo, for i = 1, 2. Lemma [67fl shows the existence of sequences of vectors Vr G R" 
satisfying 

= Vc{xr)*vi + fJ-rdl 
vl e dh{c{xr) +Vc{xr)dl.), 

for all large r, and furthermore — > w for each i = 1, 2. Consequently, for all large r 
we have 



vl e T{c(xr) + Wc{xr)dl.) for i = l,2, 
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so that 



-Hr\dl-dl\^ = {vl-v^^,Vc{xr){dl-dl)) > - p\S/c{xr){dl-dl)\\ 

Since jj, > p. > p|| Vc(a;)|p, we have the contradiction 

pII Vc(a;.r)|p > > > p||Vc(a;)||^ for all large r. 

For the special case of h convex and lower semicontinuous, we have from The- 
orem [6?5jc) that dr with the properties (|6.19p exists, for fi — 0. Uniqueness of dr 
follows from strict convexity of h^^^^^- Validity of the chain rule, which is needed to 
obtain (|6.13p . follows as in the proof of Lemma W7\ □ 

6.5. Manifold Identification. We next work toward the identification result. 
Consider a sequence of points {xr} in R" converging to the critical point x of the 
composite function hoc, and let /i^ be a sequence of positive proximality parameters. 
Suppose now that the outer function h is partly smooth at the point c = c{x) G R"^ 
relative to some manifold M C R™. Our aim is to find conditions guaranteeing 
that the update to the point c{xr) predicted by minimizing the proximal linearized 
objective /i^^,^^ lies on A^: in other words, 

c{xr) + \'c{xr)d{xr, fJ-r) G M for all large r, 

where d{xr, fJ-r) is the unique small critical point of hx^^^^- We would furthermore like 
to ensure that the "efficient projection" Xnow resulting from this prediction, guaran- 
teed by Theorem 16.61 (linear estimator improvement), satisfies c(xncw) G M.. 

To illustrate, we return to our ongoing example from Section [3J the case in which 
the outer function h is finite and polyhedral, 

h{c) = max{(/ii, c) + Pi], 

for some given vectors hi £ R™ and scalars (3i (see p.ip V If / is the active index set 
corresponding to the point c, then it is easy to check that h is partly smooth relative 
to the manifold 

M = {c: {hi,c) + Pi ^ {hj , c) + Pj for all i, j e l}. 

Our analysis requires one more assumption, in addition to those of Theorem 16.91 
The basic criticality condition (|1.3p requires the existence of a multiplier vector: 

dh{c)nmi\{Wc{x)*) ^ 0. 

We now strengthen this assumption slightly, to a "strict" criticality condition: 

ri(a/i(c)) nNuU(Vc(x)*) 7^ 0, (6.20) 

where ri denotes the relative interior of a convex set. This condition is related to 
the strict complementarity assumption in nonlinear programming. For h defined as 
above, since dh{c) = conv{/ii : i e /}, we have 

ii{dh{c)) - {^A^/i^:^A, = 1, A>0}. 

ie/ is/ 
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Hence, the strict criticality condition (|6.20p becomes the existence of a vector A G 
satisfying 



A > and ^ A^ 







■ " 


1 




1 



(6.21) 



The only change from the corresponding basic criticahty condition (|3.2p is that the 
condition A > has been strengthened to A > 0, corresponding exactly to the ex- 
tra requirement of strict complementarity in the nonlinear programming formulation 

Recall that the constraint qualification (|6.1ip implies the uniqueness of the mul- 
tiplier vector u, by Lemma 16.71 Assuming in addition the strict criticality condition 
(|6.20p . we then have 

V e ri(a/i(c)) nNun(Vc(x)*). 

We use the following result from Hare and Lewis [TO] , establishing a relationship 
between partial smoothness of functions and sets. 

Theorem 6.10. ffIR Theorem 5.1]) A function h: R™ — > IR is partly smooth at 
a point c € R™ relative to a manifold M C R™ if and only if the restriction h\M is 

around c and the epigraph epi h is partly smooth at the point (c, /i(c)) relative to 
the manifold {(c, h{c)) : c £ A4}. 

We now prove a trivial modification of [T5^, Theorem 5.3]. 

Theorem 6.11. Suppose the function h: R™ — > R is partly smooth at the point 
- g pm j^g^^^jj^g ^/jg manifold M. C R™, and is prox-regular there. Consider a 
suhgradient v £ Tidh{c). Suppose the sequence {cr\ C R™ satisfies — > c and 
h{cr) ^ hie). Then Cr £ M. for all large r if and only i/dist('u, dh{cr)) — > 0. 

Proof. The proof proceeds exactly as in [Tni Theorem 5.3], except that instead of 
defining a function g : R™ x R — > R by g{c, r) = r, we set g{c, r) ^ r — cFv. □ 

We can now prove our main identification result. 

Theorem 6.12. Consider a function h: R™ — )■ R, and a map c: R" — > R™ that 
is around the point a; G R". Suppose that h is prox-regular at the point c = c{x), 
and partly smooth there relative to the manifold M. Suppose furthermore that the 
constraint qualification i6.ll]) and the .strict criticality condition i6.20\) both hold for 
the composite function hoc at x. Then there exist constants /i, 7 > with the following 
property. Civen any sequence {/i^} with /i^ > A f^'^ '^^^ '^'^'^ 0*^2/ sequence Xr x 
such that fij.\xr — x\ 0, the local minimizer d^ of h,j.^^^^ defined in Theorem 1 6.^*1 
satisfies, for all large r, the condition 

c{Xr) + Vc{Xr)dr £ M, (6.22) 

and also the inequalities 

- {xr + dr)\ < 7|4p and /i(c(.<™)) < h{c{xr) + Vc{xr)dr) + 7|drP, (6.23) 

hold for some point x^].™ with c(a;"°^) £ M. 

In the special case when h : R™ {^oo, +00] is convex and lower semicontinuous 
function, the result holds with fi = 0. 

Proof. Theorem 16.91 implies — *■ 0, so 

Cr = c{Xr) + Vc{Xr)dr C. 
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The theorem also shows h(cr) — > h{c), and furthermore that there exist multipher 
vectors v^. G dh{cr) satisfying 

Vr ^ V G ri dh{c). 

Since 

dist(w, dh{cr)) <\v — Vr\ 0, 

we can apply Theorem 16. Ill to obtain property (|6.22p . 

Let us now define a function Hm ■ R™ ~^ IR, agreeing with h on the manifold 
A4 and taking the value +00 elsewhere. By partial smoothness, Hm is the sum of 
a smooth function and the indicator function of Ai, and hence d°°h_M{c) = Nm{c). 
Partial smoothness also implies par (9/i(c)) — Nm{c)- We can therefore rewrite the 
constraint qualification (j6.1ip in the form 

d°°hM{c) n Nu11(Vc(.t)*) = {0}. 

This condition allows us to apply Theorem 16.61 (linear estimator improvement), with 
the function hj^4 replacing the function h, to deduce the existence of the point 
as required. □ 

7. A Proximal Algorithm and its Properties. We now describe a simple 
first-order algorithm that manipulates the proximality parameter fj, in (14. ip to achieve 
a "sufficient decrease" in h at each iteration. We follow up with some results con- 
cerning the global convergence behavior of this method and its ability to identify the 
manifold A4 of Section 16.51 

Algorithm ProxDescent 

Define constants r > 1, a £ (0, 1), and /imin > 0; 

Choose xq, ^o> Mmin; 

Set /i ^ /io; 

for fc = 0, 1,2, . . . 

Set accept false; 
while not accept 

Find a local minimizer rfj, of (|4.ip with x ^ Xk 

such that K^^^{dk) < K^,.,f^{0); 
if no such d exists 

terminate with x = Xk', 
end (if) 

Derive x'^ from x^. + d^. (by an efficient projection and/or 

other enhancements); 
if h(c{xk)) - hi^cixl)) > G \h{c{xk)) - h{c{xk) + 'Vc{xk)dk)] 

and \x^ - {xk + dk)\ < ^\dk\ 
Xk+l ^ xp, 

/X ^ max(/Zmm,M/'^); 
accept <— true; 

else 

end (if) 
end (while) 
end (for). 
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We are not overly specific about the the derivation of xj^ from + d^, but we 
assume that the "efficient projection" technique that is the basis of Theorem 16.61 is 
used when possible. Lemma 16.81 indicates that for fi sufficiently large and x near a 
critical point x oi h o c, it is indeed possible to find a local solution d of (|4.ip which 
satisfies hx^i_i{d) < hx,f_i{0) as required by the algorithm, and which also satisfies the 
conditions of Theorem 16.61 Lemma 17.21 below shows further that the new point 
satisfies the acceptance tests in the algorithm. However, Lemma l7.2l is more general in 
that it also gives conditions for acceptance of the step when Xk is not in a neighborhood 
of a critical point of ft. o c. 

The framework also allows a;^ to be improved further. For example, we could use 
higher-order derivatives of c to take a further step along the manifold of h identified 
by the subproblem (|4.ip . analogous to an "EQP step" in nonlinear programming, and 
reset x'^ accordingly if this step produces a reduction in h o c. We discuss this point 
further at the end of the section. 

The main result in this section — Theorem l7.4l — specifies conditions under which 
Algorithm ProxDescent does not have nonstationary accumulation points. We start 
with a technical result that in the neighborhood of a non-critical point x and for 
bounded fi, the steps d do not become too short. 

Lemma 7.1. Consider a function h: R™ — > R and a map c: R" R™. Let x be 
such that: c is near x; h is finite at the point c — c{x) and suhdifferentially regular 
there; the transversality condition 116. 3\) holds; but the criticality condition lll.3\) is not 
satisfied. Then given any constant /imax > 0, there exists a quantity e > such that 
for any sequence x^ x with h{c{xr)) — > ft(c(5;)), and any sequence /i^ G [0,/imax]; 
any sequence of critical points d^ of h^^.^^ satisfying hx^.^^{dr) < hx^.^^{Q) must also 
satisfy liminf^ \dr \ > e. 

Proof. If the result failed, there would exist sequences Xr, tin and dr as above 
except that dr 0. Noting that h{c{xr) + '^c{xr)dr) /i(c(a;)) (using lower semicon- 
tinuity and the fact that the left-hand side is dominated by h(^c{xr)), which converges 
to hie)), we have that 

d°^h{c{Xr)+Vc{Xr)dr)r\^n\\{Vc{Xr)*) = {0}, 

for all r sufficiently large. (If this were not true, we could use an outer semiconti- 
nuity argument based on [34l Theorem 8.7] to deduce that d°°h{c) n Null(Vc(x)*) is 
nonempty, thus violating the transversality condition ()6.3p .) Hence, we can apply the 
chain rule and deduce that there are multiplier vectors Vr such that (I6.13P is satisfied, 
that is, 

= 'S/c{Xr)*Vr + Urdr, 
Vr G dh(^c{Xr) + Wc{Xr)dr) , 

for all sufficiently large r. If the sequence {vr} is unbounded, we can assume without 
loss of generality that \vr\ — > oo. Any limit point of the sequence Wr/I^rl would 
be a unit vector in the set i9°°(c) n Null(Vc(a;)*), contradicting (|6.3p . Hence, the 
sequence {vr} is bounded, so by taking limits in the conditions above and using 
lirdr — > and outer semicontinuity of dh{c) at c, we can identify a vector v such that 
V e dh{c) n Null(Vc(a;)*). Using the chain rule and subdifferential regularity, this 
contradicts non-criticality of x. □ 

The next result makes use of the efficient projection mechanism of Theorem 16.61 
When the conditions of this theorem are satisfied, we show that the Algorithm Prox- 
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Descent can perform the projection to obtain the point in such a way that (|6.10p 
is satisfied. We thus have the following result. 

Lemma 7.2. Consider the constant a e (0, 1), a junction h: R™ — s- R, and a map 
c: R" — > R™ that is around a point a; € R". Assume that h lower semicontinuous 
and finite at c — c{x) and that transversality condition h6. S]) holds at x. Then there 
exist constants fl > and S > with the following property: For any x £ Bg{x), 
d € Bg(0), and /i > /i such that 

KAd)<hx:M^ \h{c{x)+Vc{x)d)-h{c{x))\<l (7.1) 

there is a point x^ G R" such that 

h{c{x)) - h{c{x+)) > a[h{c{x)) - h{c{x) + Vc{x)d)] , (7.2a) 
\x+-{x + d)\<^\d\. (7.2b) 



Proof. Define S and 7 as in Theorem 16.61 and set S = min (5, 1/ (27)). By applying 
Theorem 16.61 we obtain a point x^ (denoted by Xnow in the earlier result) for which 
\x+ - {x + d)\ < 7|d|2 < (thus satisfying (fTSb)) ) and h{c{x+)) < h{c{x) + 

Wc{x)d) + 7|(ip. Also note that because of hx,f_i{d) < /ix,/j(0), we have 

h{c{x) + Wc{x)d) + < h{c{x)) 

and hence 

|d|2 < 1 [h{cix)) - h{c{x) + Vc(x)d)] . 

We therefore have 

h{c{x)) - h{c{x+)) > h{c{x)) ~ h{c{x) + Wc{x)d) - -f\df 
> [h{c{x)) - h{c{x) + Vc(a;)d)] (^1 - 

By choosing /i large enough that 1 — 27/// > cr, we obtain (|7.2ap . □ 
We also need the following elementary lemma. 

Lemma 7.3. For any constants r > 1 and p > and any positive integer t, we 
have 

t t 

min I ^ al^rV ^ a,; > p, a e R+l > p^(t-I). 

1=1 1=1 



Proof. By scaling, we can suppose p = 1. Clearly the optimal solution of this 
problem must lie on the hyperplane H = {a : J^i o^i — !}• The objective function is 
convex, and its gradient at the point a £ H defined by 

> 



1 -r- 

is easily checked to be orthogonal to H . Hence a is optimal, and the corresponding 
optimal value is easily checked to be strictly large than t — 1 . □ 
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In the following result, the assumptions on h, c, and x allow us to apply both 
Lemmas 17.11 and 17.21 

Theorem 7.4. Consider a constant a € (0, 1), a function h: R™ — s- R and a map 
c: R" — > R™. Let the point x G R" be such that c is near x; h is subdifferentially 
regular at the point c = c{x); the transversality condition 116. 3\) holds; and the critical- 
ity condition !il.3\) is not satisfied. Then the pair {x, h[cj) cannot be an accumulation 
point of the sequence (x^, /i(c(xfc))) generated by Algorithm ProxDescent. 

Proof. Suppose for contradiction that (x, /i(c)) is an accumulation point. Since 
the sequence {h{c{xr))} generated by the algorithm is monotonically decreasing, we 
have h{c{xr)) i h{c). By the acceptance test in the algorithm and the definition of 
hx,tj. in l|4.ip . we have that 

h(^c{xr+i)) < h(^c{xr)) — cr[h(^c{xr)) ~ h[c{xr) + Vc(a;r)(ir)] 

<h{c{Xr))-0^\dr\''. (7.3) 

Using this inequality, we have 

oo 

h{c{xo)) - h[c{x)) > ^h(c{xr)) - h(c{xr+i)) 

r=0 

oc 

oo 

^ "^Mmin ^ ^ \dr\ j 
r=l 

which implies that dr — s- 0. Further, we have that 

\h{c{Xr)+'^ c{Xr)dr) — h{c)\ 

< [h{c{Xr)) - h[c[Xr) + "^c{Xr)dr)\ + [h{c{Xr)) ~ h{c)\ 

< a-^ [h{c{xr)) - h{c(xr+i))] + [h{c{xr)) - h{c)] ^ 0. (7.4) 

Because x is an accumulation point, we can define a subsequence of indices rj, j = 
0,1,2,... such that limj^oo = x. The corresponding sequence of regularization 
parameters fir must be unbounded, since otherwise we could set /Xmax in Lemma l7.1l 
to be an upper bound on /i^^, and deduce that the sequence {dr^l is bounded away 
from zero, which contradicts dr 0. Defining p, and S as in Lemma 17. 2[ we can 
assume without loss of generality that /i^^ > r/i and /^rj+i > Mi-j for all j ■ Moreover, 
since Xr^ — > x and drj 0, and using (j7.4p . we can assume that 

Xr^eBg^^{x), for j = 0,1,2, .. ., (7.5a) 
dr e Bg{0), for aU r > tq, (7.5b) 

\h{c{xr) + Vc{xr)dr) — h{c)\ < S, for all r > tq. (7.5c) 

The value of fi cannot be increased in the inner iteration of Algorithm ProxDescent 
at iteration rj. We verify this claim by noting that because of (|7.5p . Lemma tells 
us that the previously tried value of /i, namely firj/T > would have been accepted 
by the algorithm had it tried to increase fi during iteration rj. We define kj to be 
the latest iteration prior to rj+i at which fi was increased, in the inner iteration of 

29 



Algorithm ProxDescent. Note that such an iteration kj exists, because /ir^+i > j so 
the value of fi must have been increased during some intervening iteration. Moreover, 
we have rj < kj < ^j+i. Since no increases of /i were performed internally during 
iterations kj + I, . . . jT-j+i, the value of fi used at these each iterations was the first 
one tried, which was a factor r^^ of the value from the previous iteration. That is, 

Tfl < + i = T^Vr-j + i-l = -r"Vr, + i-2 = . • . = T^' ' + ^ ll-k, ■ (7.6) 

Since the previous value of tried at iteration fcj, namely /i^^/r, was rejected, we 
can conclude from Lemma 17.21 that \xkj — x\ > 5. To see this, note that all the 
other conditions of Lemma 17.21 are satisfied by this value of /z, that is, /i^^. /r > /2, 
dkj G B^{Q) (because of (I7.5b[) ). and \h{c{xrj) + Vc(a;j.j )drj ) — h[c)\ < S (because of 
(|7.5cl) ). Recalling that {xr^^i — x\ < 5/2, and noting from the acceptance criteria in 
Algorithm ProxDescent that \xk+i — Xk\ < \xk+i — {xk + dk)\ + \dk\ < {3/2)\dk\, we 
have that 

-6 < \xr^^, - Xk^\ < \xk+i~Xk\^- \dk\. (7.7) 

To bound the decrease in objective function over the steps from Xk- to Xr-^^, we 
have from the acceptance condition and (|7.6p that 

rj + i-l 

h{c{xkj)) - h{c{xr^^J) = Y h(c{xk)) - h{c(xk+i)) 

>| ^ ^ik\dk\^ 

To obtain a lower bound on the final summation, we apply Lemma 17.31 with p — 5/3 
(from (|7.7p ) and t = r^+i — kj > 1 to obtain 

h{c{xk^))-h{c{xr^^,)) > ^f,^^^^r-'Q\r-l) > > Q, 

where we have used /^r +1 > Tp,. Since this finite decrease happens for every index 
j = 1,2,..., we obtain a contradiction from the usual telescoping sum argument. □ 

To illustrate the idea of identification, we state a simple manifold identification 
result for the case when the function h is convex and finite. 

Theorem 7.5. Consider a function h : IR™ IR, a map c : R" IR™, and a 
point X G IR" that is critical for hoc. Suppose that c is near x, and that h is convex 
and continuous on domft, near c :— c{x). Suppose in addition that h is partly smooth 
at c relative to the manifold M. Finally, assume that the constraint qualification 
h6.11\) and the strict criticality condition 116. 20\} both hold for the composite function 
hoc at X. 

Then if Algorithm ProxDescent generates a sequence Xr — > x, we have that 0(2;^) + 
\7c{xr)dr e A4 for all r sufficiently large. 
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Proof. Note that h, c, and x satisfy the assumptions of Theorem I6.12[ with 
fi — 0. To apply Theorem 16.121 and thus prove the result, we need to show only 
that fir\xr — x\ 0. In fact, we show that {fir} is bounded, so that this estimate is 
satisfied trivially. 

Using Lemma [TjU we have that the step acceptance condition of Algorithm Prox- 
Descent is satisfied at Xr for all fi > fl. It follows that for all r sufficiently large, we 
have in fact that fir < Tp,, which leads to the desired result. □ 

To enhance the step d obtained from (|4.ip . we might try to incorporate second- 
order information inherent in the structure of the subdifferential dh at the new 
value of c predicted by the linearized subproblem. Knowledge of the subdifferential 
9/i(cpiod(a^)) allows us in principle to compute the tangent space to A4 at Cpiod(a;). 
We could then try to "track" A4 using second-order information, since both the map 
c and the restriction of the function h to M are C^. 
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